Please note that this analysis should be re-done when more data is acquired

We start by importing the data and transforming it to extract only the answers to the questions.

raw.df <- read.csv('processed_personality_quiz_responses.csv')
colnames(raw.df) <- c('X1','X2','q1.1','q1.2','q1.3','q1.4','q1.5','q1.6','q1.7','q1.8','q1.9','q1.10',
                      'q2.1','q2.2','q2.3','q2.4','q2.5','q2.6','q2.7','q2.8','q2.9','q2.10', 'X3')
col_select1 <- c('q1.1','q1.2','q1.3','q1.4','q1.5','q1.6','q1.7','q1.8','q1.9','q1.10')
col_select2 <- c('q2.1','q2.2','q2.3','q2.4','q2.5','q2.6','q2.7','q2.8','q2.9','q2.10')
q1.df <- raw.df[,col_select1]
q2.df <- raw.df[,col_select2]
colnames(q1.df) <- c('q1','q2','q3','q4','q5','q6','q7','q8','q9','q10')
colnames(q2.df) <- c('q1','q2','q3','q4','q5','q6','q7','q8','q9','q10')
data.df <- rbind(q1.df, q2.df)
head(data.df)
##   q1 q2 q3 q4 q5 q6 q7 q8 q9 q10
## 1  0  1  0  1  2  0  1  0  2   0
## 2  0  1  1  1  1  1  1  0  0   0
## 3  0  1  0  2  1  0  1  1  2   0
## 4  0  0  1  1  1  0  0  1  2   0
## 5  1  1  0  1  1  0  1  1  2   1
## 6  1  1  1  0  2  1  0  0  0   0

Now we scale the data and perform k-means clustering with 2 classes to try and determine trust between users.

scaled_data.matrix <- scale(data.matrix(data.df))
km <- kmeans(scaled_data.matrix, centers=2)
kcluster=as.factor(km$cluster)
heatmap.2(km$centers, main='Heatmap of Cluster Centroids', cexRow=0.75, cexCol=0.75, scale="none", dendrogram="none",Colv= FALSE, Rowv=FALSE, tracecol=NA,density.info='none')

From the heatmap above, we can see that the important questions for determining trust are:

Now we can perform PCA on the scaled data:

my.pca <- prcomp(scaled_data.matrix,retx=TRUE)
head(t(summary(my.pca)$importance))
##     Standard deviation Proportion of Variance Cumulative Proportion
## PC1          1.5296176                0.23397               0.23397
## PC2          1.4096992                0.19873               0.43270
## PC3          1.2102364                0.14647               0.57917
## PC4          1.0784266                0.11630               0.69547
## PC5          1.0037484                0.10075               0.79622
## PC6          0.8112389                0.06581               0.86203

As we can see from the Cumulative Proportion, the first two principal components only explain around 43% of the variance in the data. With more data, this proportion should increase which will result in a more accurate analysis (hopefully).

Lastly, we plot the data points on an interactive biplot, which shows the data plotted with respect to PC1 and PC2 as well as having the each question with its own feature axis. The longer the feature axis line, the more influence it has.

plot1<-ggbiplot(my.pca,choices=c(1,2),
                labels=rownames(scaled_data.matrix), #show point labels
                var.axes=TRUE, # Display axes 
                ellipse = FALSE, # Don't display ellipse
                groups=kcluster,
                obs.scale=1) + # Keep original scaling
  ggtitle("Quiz Data Projected on PC1 and PC2 by Cluster")
if (out_type=="latex") {plot1} else {ggplotly(plot1)}